Segment and Combine Approach for Non-parametric Time-Series Classification

نویسندگان

  • Pierre Geurts
  • Louis Wehenkel
چکیده

This paper presents a novel, generic, scalable, autonomous, and flexible supervised learning algorithm for the classification of multivariate and variable length time series. The essential ingredients of the algorithm are randomization, segmentation of time-series, decision tree ensemble based learning of subseries classifiers, combination of subseries classification by voting, and cross-validation based temporal resolution adaptation. Experiments are carried out with this method on 10 synthetic and real-world datasets. They highlight the good behavior of the algorithm on a large diversity of problems. Our results are also highly competitive with existing approaches from the literature. 1 Learning to classify time-series Time-series classification is an important problem from the viewpoint of its multitudinous applications. Specific applications concern the non intrusive monitoring and diagnosis of processes and biological systems, for example to decide whether the system is in a healthy operating condition on the basis of measurements of various signals. Other relevant applications concern speech recognition and behavior analysis, in particular biometrics and fraud detection. From the viewpoint of machine learning, a time-series classification problem is basically a supervised learning problem, with temporally structured input variables. Among the practical problems faced while trying to apply classical (propositional) learning algorithms to this class of problems, the main one is to transform the non-standard input representation into a fixed number of scalar attributes which can be managed by a propositional base learner and at the same time retain information about the temporal properties of the original data. One approach to solve this problem is to define a (possibly very large) collection of temporal predicates which can be applied to each time-series in order to compute (logical or numerical) features which can then be used as input representation for any base learner (e.g. [7, 8, 10, 11]). This feature extraction step can also be incorporated directly into the learning algorithm [1, 2, 14]. Another approach is to define a distance or similarity measure between time-series that takes into account temporal specific peculiarities (e.g. invariance with respect to time or amplitude rescaling) and then to use this distance measure in combination with nearest neighbors or other kernel-based methods [12, 13]. A potential advantage of these approaches is the possibility to bias the representation by exploiting prior problem specific knowledge. At the same time, this problem specific modeling step makes the application of machine learning non autonomous. The approach investigated in this paper aims at developing a fully generic and off-the-shelf time-series classification method. More precisely, the proposed algorithm relies on a generic pre-processing stage which extracts from the timeseries a number of randomly selected subseries, all of the same length, which are labeled with the class of the time-series from which they were taken. Then a generic supervised learning method is applied to the sample of subseries, so as to derive a subseries classifier. Finally, a new time-series is classified by aggregating the predictions of all its subseries of the said size. The method is combined with a ten-fold cross-validation wrapper in order to adjust automatically the size of the subseries to a given dataset. As base learners, we use tree-based methods because of their scalability and autonomy. Section 2 presents and motivates the proposed algorithmic framework of segmentation and combination of time-series data and Section 3 presents an empirical evaluation of the algorithm on a diverse set of time-series classification tasks. Further details about this study may be found in [4]. 2 Segment and Combine Notations A time-series is originally represented as a discrete time finite duration real-valued vector signal. The different components of the vector signal are called temporal attributes in what follows. The number of time-steps for a given temporal attribute is called its duration. We suppose that all temporal attributes of a given time-series have the same duration. On the other hand, the durations of different time-series of a given problem (or dataset) are not assumed to be identical. A given time series is related to a particular observation (or object). A learning sample (or dataset) is a set (ordering is considered irrelevant at this level) of N preclassified time-series denoted by LSN = {(

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A comparison of parametric and non-parametric methods of standardized precipitation index (SPI) in drought monitoring (Case study: Gorganroud basin)

The Standardized Precipitation Index (SPI) is the most common index for drought monitoring. Although the calculation of this index is usually done by using the gamma distribution fitting of precipitation data, studies have shown that for accurate monitoring of drought, the optimal distribution of precipitation in each month should be determined. On the other hand, in non-stationary time series,...

متن کامل

Investigation of Trend of Precipitation Variation Using Non-Parametric Methods in Charmahal O Bakhtiari Province

Climatic parameters in time and space scales of change are for many reasons of Changes and how they should be based on observations using a statistical method to be determined. Analysis of the most widely used statistical methods that assess potential climate change on hydrological time series, such series of precipitation, temperature and flow rate used. This study of 11 synoptic,rain gage and...

متن کامل

Investigation of Trend of Precipitation Variation Using Non-Parametric Methods in Charmahal O Bakhtiari Province

Climatic parameters in time and space scales of change are for many reasons of Changes and how they should be based on observations using a statistical method to be determined. Analysis of the most widely used statistical methods that assess potential climate change on hydrological time series, such series of precipitation, temperature and flow rate used. This study of 11 synoptic,rain gage and...

متن کامل

On the Detection of Trends in Time Series of Functional Data

A sequence of functions (curves) collected over time is called a functional time series. Functional time series analysis is one of the popular research areas in which statistics from such data are frequently observed. The main purpose of the functional time series is to predict and describe random mechanisms that resulted in generating the data. To do so, it is needed to decompose functional ti...

متن کامل

A land covers classification system for environment assessment in semi-arid regions of Iran

Land degradation is a major danger which restricting different areas of Iran. Systematic description of the environmentfor detection of environmental changes and the human-related causes and responses is essential in land cover changestudy. Use of land cover data allow detection of where certain changes occur, what type of change, as well as how theland is changing. Existing systems for classif...

متن کامل

پیش‌بینی خشکسالی هیدرولوژیک با استفاده از سری‌های زمانی

INTRODUCTION Hydrologic drought in the sense of deficient river flow is defined as the periods that river flow does not meet the needs of planned programs for system management. Drought is generally considered as periods with insignificant precipitation, soil moisture and water resources for sustaining and supplying the socioeconomic activities of a region. Thus, it is difficult to give a univ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005